Lista del top 20 de hashtags más usados y su frecuencia
Code
# convert dataframe column to listhashtags = df['hashtags'].to_list()# remove nan items from listhashtags = [x for x in hashtags ifnot pd.isna(x)]# split items into a list based on a delimiterhashtags = [x.split('|') for x in hashtags]# flatten list of listshashtags = [item for sublist in hashtags for item in sublist]# count items on listhashtags_count = pd.Series(hashtags).value_counts()# return first n rows in descending ordertop_hashtags = hashtags_count.nlargest(20)top_hashtags
# filter column from dataframeusers = df['mentioned_names'].to_list()# remove nan items from listusers = [x for x in users ifnot pd.isna(x)]# split items into a list based on a delimiterusers = [x.split('|') for x in users]# flatten list of listsusers = [item for sublist in users for item in sublist]# count items on listusers_count = pd.Series(users).value_counts()# return first n rows in descending ordertop_users = users_count.nlargest(20)top_users
# plot the data using plotlyfig = px.line(df, x='date', y='like_count', title='Número de likes en el tiempo', template='plotly_white', hover_data=['text'])# show the plotfig.show()
Tokens
Lista del top 20 de los tokens más comunes y su frecuencia
Code
# load the spacy model for Spanishnlp = spacy.load("es_core_news_sm")# load stop words for SpanishSTOP_WORDS = nlp.Defaults.stop_words# Function to filter stop wordsdef filter_stopwords(text):# lower text doc = nlp(text.lower())# filter tokens tokens = [token.text for token in doc ifnot token.is_stop and token.text notin STOP_WORDS and token.is_alpha]return' '.join(tokens)# apply function to dataframe columndf['text_pre'] = df['text'].apply(filter_stopwords)# count items on columntoken_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]token_counts
Lista de las 10 horas con más cantidad de tweets publicados
Code
# extract hour from datetime columndf['hour'] = df['date'].dt.strftime('%H')# count items on columnhours_count = df['hour'].value_counts()# return first n rows in descending ordertop_hours = hours_count.nlargest(10)top_hours
Plataformas desde las que se publicaron contenidos y su frecuencia
Code
df['source_name'].value_counts()
source_name
Twitter for iPhone 16202
Twitter Web App 7273
Twitter Web Client 165
Twitter for Android 47
Name: count, dtype: int64
Tópicos
Técnica de modelado de tópicos con transformers y TF-IDF
Code
# visualize topicstopic_model.visualize_topics()
Reducción de tópicos
Mapa con 10 tópicos del contenido de los tweets
Code
# visualize topicstopic_model.visualize_topics()
Términos por tópico
Code
topic_model.visualize_barchart(top_n_topics=11)
Análisis de tópicos
Selección de tópicos que tocan temas de género
Code
# selection of topicstopics = [0]keywords_list = []for topic_ in topics: topic = topic_model.get_topic(topic_) keywords = [x[0] for x in topic] keywords_list.append(keywords)# flatten list of listsword_list = [item for sublist in keywords_list for item in sublist]# use apply method with lambda function to filter rowsfiltered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in word_list))]percentage =round(100*len(filtered_df) /len(df), 2)print(f"Del total de {len(df)} tweets de @MamelaFialloFlo, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")
Del total de 23687 tweets de @MamelaFialloFlo, 8110 hablan sobre estos temas
Alrededor del 34.24% de los tweets de @MamelaFialloFlo hablan sobre estos temas
Code
# drop rows with 0 values in two columnsfiltered_df = filtered_df[(filtered_df.like_count !=0) & (filtered_df.retweet_count !=0)]# add a new column with the sum of two columnsfiltered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2# extract year from datetime columnfiltered_df['year'] = filtered_df['date'].dt.year# remove urls, mentions, hashtags and numbersp.set_options(p.OPT.URL)filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))# Create scatter plotfig = px.scatter(filtered_df, x='like_count', y='retweet_count', size='impressions', color='year', hover_name='tweet_text')# Update title and axis labelsfig.update_layout( title='Tweets talking about gender with most Likes and Retweets', xaxis_title='Number of Likes', yaxis_title='Number of Retweets')fig.show()